Instance Based Table Integration Algorithm for Multilingual Tables on the Web
نویسنده
چکیده
We present an instance based table integration algorithm. A table is a set of instances of a record which consists of fields. A field is a pair of an attribute name and a sequence of attribute values of the same type. Given tables, the algorithm calculates two numerical features for each field using character codes and then finds correspondence between fields among tables. The novelty of the algorithm is that it uses the character code chart for the language in which the contents of the tables are written. This enables that a field can be represented by only two types of features. The algorithm requires neither an attribute value contained in all input tables nor attribute names. So, the algorithm is suitable for tables obtained from Web data, as long as they are written in the same language. Applying the algorithm for real Web data written in many languages, we demonstrate that the algorithm yields the accurate results and is robust for errors. The languages are Chinese, English, Germany, Japanese, and Korean.
منابع مشابه
Web-Scale Web Table to Knowledge Base Matching
Millions of relational HTML tables are found on the World Wide Web. In contrast to unstructured text, relational web tables provide a compact representation of entities described by attributes. The data within these tables covers a broad topical range. Web table data is used for question answering, augmentation of search results, and knowledge base completion. Until a few years ago, only search...
متن کاملA Semantic Web Knowledge Base System that Supports Large Scale Data Integration
A true Semantic Web knowledge base system must scale both in terms of number of ontologies and quantity of data. It should also support reasoning using different points of view about the meanings and relationships of concepts and roles. We present our DLDB3 system that supports large scale data integration, and is provably sound and complete on a fragment of OWL DL when answering extensional co...
متن کاملDetecting Tables in HTML Documents
Table is a commonly used presentation scheme, especially for describing relational information. Table understanding on the web has many potential applications including web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as elements, often the tag is used liberally to ach...
متن کاملAutomating the extraction of data from HTML tables with unknown structure
Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. Our solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to find tables of inter...
متن کاملMatching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings
Web tables constitute valuable sources of information for various applications, ranging from Web search to Knowledge Base (KB) augmentation. An underlying common requirement is to annotate the rows of Web tables with semantically rich descriptions of entities published in Web KBs. In this paper, we evaluate three unsupervised annotation methods: (a) a lookup-based method which relies on the min...
متن کامل